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ABSTRACT 

Motivation: Expression quantitative trait loci (eQTL) studies have 
discovered thousands of genetic variants that regulate gene 
expression and have enabled a better understanding of the functional 
role of non-coding sequences. However, eQTL studies are generally 
quite expensive, requiring large sample sizes and genome-wide 
genotyping of each sample. On the other hand, allele specific 
expression (ASE) is becoming a very popular approach to detect the 
effect of genetic variation on gene expression, even within a single 
individual. This is typically achieved by counting the number of RNA- 
seq reads matching each allele at heterozygous sites and rejecting 
the null hypothesis of 1 :1 allelic ratio. In principle, when genotype 
information is not readily available it could be inferred from the RNA- 
seq reads directly. However, there are currently no methods that 
jointly infer genotype and test for ASE or that include the uncertainty 
in the genotype calls within the ASE inference step. 
Results: Here, we present QuASAR, Quantitative Allele Specific 
Analysis of Reads, a novel statistical learning method for jointly 
detecting heterozygous genotypes and inferring ASE. The proposed 
ASE inference step takes into consideration the uncertainty in the 
genotype calls while including parameters that model base-call errors 
in sequencing and allelic over-dispersion. We validated our method 
with experimental data for which high quality genotypes are available. 
Results for an additional dataset with multiple replicates at different 
sequencing depths demonstrate that QuASAR is a powerful tool for 
ASE analysis when genotypes are not available. 

Availability: http : //github. com/piquelab/QuASAR 
Contact: f luca@wayne . edu; rpique@wayne . edu 



1 INTRODUCTION 

Quantitative trait loci (QTLs) for molecular cellular phenotypes 
(as defined by Dermitzakis, 2012), such as gene expression 
(eQTL) (e.g. Stranger et al., 2007), transcription factor (TF) 
binding (Kasowski et al, 2010), and DNase I sensitivity (Degner 
et al, 2012) have begun to provide a better understanding of how 
genetic variants in regulatory sequences can affect gene expression 
levels (see also Stranger et al, 2007; Gibbs et al, 2010; Melzer 
et al, 2008; Gieger et al., 2008). eQTL studies in particular have 
been successful in identifying genomic regions associated with gene 
expression in various tissues and conditions (e.g., Maranville et al, 
2011; Barreiro et al., 2012; Nica et al, 2011; Smirnov et al, 2009; 
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Dimas et al., 2009; Ding et al., 2010; Grundberg et ah, 2011; Lee 
et al, 2014; Fairfax et al., 2014). While previous studies have 
shown an enrichment for GWAS hits among regulatory variants 
in lymphoblastoid cell lines (LCLs) (Nica et al, 2010; Nicolae 
et al, 2010), a full understanding of the molecular mechanisms 
underlying GWAS hits requires a functional characterization of 
each variant in the tissue and environmental conditions relevant for 
the trait under study (e.g. estrogen level for genetic risk to breast 
cancer Cowper-SaHari et al., 2012). 

The ongoing GTEx project will significantly increase the number 
of surveyed tissues for which eQTL data are available and will 
represent a useful resource to functionally annotate genetic variants. 
However, the number of cell-types and environments explored will 
still be a very small fraction compared to a presumably large number 
of regulatory variants that mediate specific GxE interactions. eQTL 
studies are generally quite expensive, requiring large sample sizes 
(n > 70) which may be difficult to achieve for tissues that are 
obtained by surgical procedures or are difficult to culture in vitro. 
Even if biospecimens are readily available at no cost, eQTL studies 
require large amounts of experimental work to obtain genotypes 
and gene expression levels. As the measurement of gene expression 
using high-throughput sequencing (RNA-seq) is becoming more 
popular than microarrays, RNA-seq library preparation is also 
becoming less expensive ($46 per sample) while costs of sequencing 
are also very rapidly decreasing (for example, 16M reads per sample 
would cost $49 using a multiplexing strategy). Additionally, the 
sequence information provided by RNA-seq can be used to call 
genotypes (Shah et al, 2009; Duitama et al, 2012; Piskol et al., 
2013), detect and quantify isoforms (Trapnell et al., 2010; Katz 
et al., 2010) and to measure allele specific expression (ASE), if 
enough sequencing depth is available (Degner et al. , 2009; Pastinen, 
2010). 

ASE approaches currently represent the most effective way to 
assay the effect of a cis-regulatory variant within a defined cellular 
environment, while controlling for any trans-acting modifiers of 
gene expression, such as the genotype at other loci (Pastinen, 2010; 
Kasowski et al., 2010; McDaniell et al, 2010; Cowper-SaHari 
et al, 2012; Reddy et al, 2012; Hasin-Brumshtein et al, 2014; 
McVicker et al, 2013). As such, ASE studies have greater statistical 
power to detect genetic effects in cis and can be performed using a 
smaller sample size than a traditional eQTL mapping approach. 

In the absence of ASE, the two alleles for a heterozygous 
genotype at a single nucleotide polymorphism (SNP) in a gene 
transcript are represented in a 1:1 ratio in RNA-seq reads. To 
reject the null hypothesis and infer ASE it is necessary to identify 
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Fig. 1. Reference allele frequency from reads overlapping SNPs. (Left) Each dot represents a SNP covered by at least 15 RNA-seq reads. The y-axis 
represents the fraction of RNA-seq reads that match the reference allele (observed pi). The x-axis represents the order of the SNP position in a chromosome. 
(Right) Histogram showing the distribution of pi values across the genome. The three modes (p G {1, 0.5, 0}) correspond respectively to the three possible 
genotypes: homozygous reference (RR), heterozygous under no ASE (RA), and homozygous alternate (AA). 



heterozygous SNPs with high confidence and a significant departure 
of the 50% allelic ratio. While genotyping and ASE are usually 
considered two separate problems, miscalling an homozygous SNP 
as heterozygous will also likely induce an error in rejecting the ASE 
null hypothesis; thus, we argue that the two problems should be 
solved together. 

While it is possible to obtain genotype information from RNA- 
seq (Shah et al., 2009; Duitama et al, 2012; Piskol et al, 2013), 
to the best of our knowledge, all existing methods for detecting 
ASE consider that the genotypes are known and usually the error 
probabilities associated with genotyping are not taken into account 
for the ASE step. While overall genotyping quality can also be 
modeled within the ASE model (McVicker et al., 2013), there is 
currently no method that for each SNP can jointly genotype and 
analyze ASE. Here, we propose a novel framework for quantitative 
allele specific analysis of reads (QuASAR) that starts from a single 
or multiple RNA-seq experiments from the same individual and can 
directly identify heterozygous SNPs and assess ASE accurately by 
taking into account base-calling errors and overdispersion in the 
ASE ratio. QuASAR is then evaluated in two different datasets that 
demonstrate the accuracy and the importance of incorporating the 
genotype uncertainty in determining ASE. 



2 APPROACH 

QuASAR starts with experimental high-throughput sequencing 
data. Here we focus on RNA-seq, but the same or similar pipeline 
can be applied to DNase-seq, ChlP-seq, ATAC-seq or other types 
of functional genomics library preparation. Figure 1 illustrates 
the underlying problem: detecting SNPs covered by reads with 



high allelic imbalance for which homozygosity (in the presence of 
base-calling errors) can also be rejected. 

We focus our attention on sites that are known to be variable in 
human populations, specifically we consider all SNPs from the 1000 
Genomes project (1KG) with a minor allele frequency MAF > 0.02. 
We index each SNP with I G {1, ...,£}, and each sample by 
s G {1, ...,£}. Samples are all from the same individual and 
may represent different experimental conditions or replicates. We 
only consider SNPs represented in at least 15 reads across all the 
samples. At each site I, three alternative genotypes are possible 
gi G {0,1,2} being homozygous reference (RR), heterozygous 
(RA), or homozygous alternate (AA) respectively. For each sample 
s and site I, N s j represents the total number of reads and r s ; = 
{r s ik}^l\ take the value 1 if read k matches the reference allele, 
and 0 if it matches the alternate allele. We can then model the data 
V — {{r^jf^j^ljas a mixture model 

S L 

»=i '=i 9ie{o,i,a} 

where Pr (gi) represents the prior probability associated with each 
genotype. Pr (r s i\gi) depends on the genotype, for Gi — 0: 

Pr{r s> i\g, =0;e s ) = ]J (1 - e s ) r ^ el^* 1 " (2) 
fe=i 

where we will only observe reads matching the alternate allele if 
those are base-calling errors, here modeled by the parameter e 3 . 
Conversely, for Gi = 2 we have the following: 

Pr(r Sii | 9i =2;e s ) = n(l-^) 1 " rsit ^ fc (?) 
fc=i 
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If the genotype is heterozygous Gi = 1, we may preferentially 
observe the reference allele with probability pi (or the alternate 
allele with probability 1 — pi). This results in the following model: 

N*l 

Pr (r s ,i\gt = 1; e.) = JJ ((1 - P i)(l - e s ) + Pl e s f- r ^ ■ 

k=l 

(pi(l-e s ) + (l-pi)e s y^ 

(4) 

These expressions can be simplified by considering that R s i = 
12kLi r sik and A s i — N a i — R„i are respectively the number of 
reads from sample s observed at site I matching the reference allele 
and the alternate allele: 

Pr (r sl \ gi = 0; e s ) (1, e s ) R *> [1 - V (1, ^)} Asl (5) 
Pr(r s! | ff! =2;e s )=^(0,e s ) fl - ! [1 - V (0, e s )] A ^' (6) 
Pr {r s i\gi = 1; e 3 , p si ) =ip {p s i,e s ) Rsl [1 - ip (p s i, t s )] Asl (7) 

where ip (p, e) = [p(l — e) + (1 — p)e] and makes explicit that 
gi = 2 (or gi — 0) is indistinguishable from pi s — 0 (or pi s = 1) 
when gi = 1. In QuASAR we resolve this identifiably problem by 
assuming that those cases with extreme ASE imbalance across all 
replicates are more likely to be homozygous genotypes. 

To fit the mixture model we use an EM algorithm (see Methods 
for more details) in which we estimate sample specific base-calling 
error rates e s (p is fixed to 0.5) and we are also able to provide a 
posterior probability for the genotype. For the ASE inference step, 
we wish to reject the null hypothesis p 3 i — 0.5. We additionally 
consider that ij) in (5-7) is a random variable ^ s i sampled from a 
~Beta(a s ;, /3 s i) distribution with: 

a 3 i = tp s iM 3 /3si = (1 - i(> a i) M 3 ij) s i = ip(p 3 i,eu) (8) 

where M a hyper parameter controls for over-dispersion and results 
in a better calibrated test as shown in the Results section. The 
resulting distribution on the number of reads after combining (7) 
and (8) is known as Beta-binomial distribution (13). Finally, the 
inference step takes into account the over-dispersion and genotype 
uncertainty, and can be formalized as a likelihood ratio test (LRT): 

sup psig{0 0 5 1} {Pr (r s i\p 3 i,e 3 ,M 3 jj 

Asl = r — 7 r n — (9) 

sup Psi jPr [r 3 i\p 3 i,e 3 ,M 3 J J 

where the set of parameters {e a , M s }f =1 are maximum likelihood 
estimates (MLE) under the null hypothesis pi s — 0.5 (see 
Methods section). To calculate a p- value we use the property that 
—2 log(A s ;) is asymptotically distributed as %i- 



3 METHODS 

3.1 Experimental data 

Lymphoblastoid cell lines (LCLs: GM18507 and GM18508) were purchased 
from Coriell Cell Repository and human umbilical vein endothelial cells 
(HUVECs) from Lonza. LCLs were cultured and starved according to 
(Maranville et at, 201 1). Cryopreserved HUVECs were thawed and cultured 
according to the manufacturer protocol (Lonza), with the exception that 



48 hour prior to collection the medium was changed to a starvation 
medium, composed of phenol-red free EGM-2, without Hydrocortisone and 
supplemented with 2% charcoal stripped-FBS. Cells were washed 2X using 
ice cold PBS, lysed on the plate, using Lysis/Binding Buffer (Ambion), and 
frozen at -80C. mRNA was isolated using the Ambion Dynabeads mRNA 
Direct kit (Life Technologies). We then prepared libraries for Illumina 
sequencing using a modified version of the NEBNext Ultra Directional 
RNA-seq Library preparation kit (New England Biolabs). Briefly, each 
mRNA sample was fragmented (300 nt) and converted to double-stranded 
cDNA, onto which we ligated barcoded DNA adapters (NEXTflex-96 
RNA-seq Barcodes, BIOO Scientific). Double-sided SPRI size selection 
(SPRISelect Beads, Beckman Coulter) was then performed to select 350-500 
bp fragments. The libraries were then amplified by PCR and pooled together 
for sequencing on the Illumina HiSeq 2500 at the University of Chicago 
Genomics Core. 

For each LCL sample, libraries from 9 replicates were pooled for a total 
of 42.3M and 34.9M 50bp PE reads, respectively. A total of 18 replicates 
across 6 time points were performed for the HUVEC samples (267M reads 
total), to capture a wide range of basal physiologic conditions. 

3.2 Pre-processing 

To create a core set of SNPs for ASE analysis, we first removed rare (MAF 
< 2%) variants from all 1KG SNPs. We also removed SNPs within 25 
bases up- or downstream of another SNP or short InDels as well as those 
SNPs in regions of annotated copy numbers or other blacklisted regions 
(Degner et at, 2012). Sequencing reads were aligned to the reference 
human genome hgl9 using bwa mem (Li and Durbin, 2009 http:// 
bio-bwa.sourceforge.net). Reads with quality <10 and duplicate 
reads were removed using samtools rmdup (http : / / github . com/ 
samtools/). Using a mappability filter (Degner et al. , 20 1 2), we removed 
reads that do not map uniquely to the reference genome and to alternate 
genome sequences built considering all 1KG variants. Aligned reads were 
then piled up on these SNPs using samtools mpileup command. Reads 
with a SNP at the beginning or at the end of the read were also removed to 
avoid any potential experimental bias. Finally, the pileups were re-formatted 
so that each SNP has a count for reads containing the reference allele, and a 
count for those containing the alternate allele. 

3.3 Model fitting and parameter estimation 

In order to use the expectation maximization (EM) technique (McLachlan 
and Krishnan, 2007), we first convert (1) to a "complete" likelihood, as if we 
knew the underlying genotypes Q = {Gjj^Lj^: 

L{&) = Pr (T>, g\0) = Pr (V\Q; 6) Pr (g|0) = (10) 
L 2 r S 

= n n i pr ( G < = 9) n = 9-,e s , Psl ) \ 

where = l(Gi = g) are binary indicator variables and © represents the 
set of all parameters of the model. In log-likelihood form we have: 

L 2 S L 

i(e) = io g L(B) = J2J2 G f HP?(Gi =g)) + J2J2^ < n > 

i = l a =0 s=l 1=1 

G° [R sl In </>(!, e») + A sl ln(l - e 8 ))] 
+G] [R s i In i>{p sl ,e s )+ A sl ln(l - ^(p al ,e s ))] 

+Gf [R sl lnV>(0, e s ) + A st ln(l - </>(0, e s ))]} 

During the genotyping step, in order to maintain identifiability of the 
model, we fix p a i = 0.5 for all loci. Although M s could also be estimated 
within the EM procedure, we only consider overdispersion on the ASE step. 
These two choices lead to a much simpler EM procedure and a slightly 
conservative estimate of e s . 

E-Step: From the complete likelihood function (11) we derive 
the expected values for the unknown genotype indicator variables 
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E(Gf |X>, 0) = (G?) given the observed data and the current estimates 
for the model parameters. We are also interested in (Gf) = Pr(Gf = 
g\V, 0), because they are the posterior probability of each genotype given 
the data: 



C=((G») + (G?) + (G?))- 1 
(G?) =CPr(G ; =0)exp (^[i? is ln(l - e s ) + A la ln(e,)] 



(12) 



(Gj) = GPr(G ; = l)cxp ln(0.5)^)[fl Is + A 



(Gf) =GPr(G ; = 2)exp 



J2[R ls ln(e s ) + A la ln(l 
s=l 



The prior genotype probabilities Pr(G; = g) are obtained from the 1KG 
allele frequencies assuming Hardy-Weinberg equilibrium, but the user can 
change this. 

M-Step: Using the expected values from the E-step, the complete 
likelihood is now a function of e a that is easily maximized 

Ef =1 «G?)A s! 



logit 



In : 



Ef =1 ((G°)R al + (Gf)A al ) 

After we run QuASAR to infer genotypes across samples from the same 
individual, for each site we have a posterior probability for each genotype 
(Gf ), and a base-calling error e a estimated for each sample. From these 
posteriors, discrete genotypes are called by using the genotype with the 
highest posterior probability; the maximum a posteriori (MAP) estimate. 

ASE-Inference: To detect ASE we only consider sites with an 
heterozygous MAP higher than a given threshold (e.g., (Gj) > 0.99). We 
then test the possibility that p 3 i is different than 0.5 while also taking into 
account overdispersion using a beta-binomial model (by combining (7) and 
(8)): 

Pr(fl si |/V si ,^,M s ) = 

= / N s j\ T (M s ) r (Rsi + fe;M a ) T (A sl + (1 - -4> a i) M s ) 
\R sl ) r(Nsi + Ms)T(iPsiMs)r((l-->Psi)M s ) 

where M s controls the effective number of samples supporting the prior 
belief that p = .5 and is estimated using grid search: 

Ms = arg max ( ]J Pr (R sl \N al , e a , p a i = 0.5, M s )\ (.14] 



1=1 



Then, we estimate p s l using (13) and M s from (14) using a standard gradient 
method (L-BFGS-B) to maximize the following log-likelihood function 



l(Psi;M s ,e s ) = Pr(R al \Nsi,ipsl = i>(p a i, e a ), M a 



(15) 



Finally, all these parameters are used to calculate the LRT in (9) to get a 
p- value. 

Additionally, we can provide an estimate of the standard error associated 
with the parameter p a i using the second derivative of the log-likelihood 
function (15): 



'Psl 



d 2 

—^-l(p s i;M a ,e a ) 
9 Psl 



(16) 

Psl=Psl 

alternatively we can also recover a standard error from (9) (as is 
asymptotically distributed as Xrff=i)> allowing the p- value to be used to 
back solve for the standard error: 



Pis 



(17) 



where Q() is the quantile function for a standard normal distribution and pi a 
is the p-value from (9). We use the first form (16) when p ~ 0.5 and (17) 



otherwise as they give a better approximation at those ranges, respectively. 
Alternatively, if we do not need o'p sl we can use (15) to obtain a profile 
likelihood confidence interval for p s ;. 

4 RESULTS 

We implemented the QuASAR approach as detailed in the Methods 
section in an R package available at http://github.com/ 
piquelab/QuASAR. We first sought to evaluate QuASAR 
genotyping accuracy using RNA-seq reads obtained from two 
lymphoblastoid cell-lines (LCLs) that already have very high quality 
genotypes calls from the 1KG project (GM18507 and GM18508). 
As illustrated in Table 1, we are able to accurately genotype 
thousands of loci, and genotyping error rates decrease with an 
increase of the MAP threshold. The method by construction is 
more conservative in making heterozygous calls. In case of doubt, 
between a heterozygous genotype with extreme allelic imbalance, or 
homozygous genotype with base call errors, our model would lean 
in favor of the homozygote call. This is a crucial feature for accurate 
inference of ASE as we will discuss in more detail in this section. 



Sample 


QuASAR 
Performance 


MAP (Gf ) 


>0.5 


>0.9 


>0.99 


NA 18507 


Heterozygotes 
False Discoveries 


1706 
1.00% 


1704 
1.00% 


1702 

0.94% 


Homozygotes 
False Discoveries 


5510 
4.83% 


5509 
4.83% 


5506 
4.83% 


NA 18508 


Heterozygotes 
False Discoveries 


1466 
0.41% 


1466 
0.41% 


1465 

0.34% 


Homozygotes 
False Discoveries 


4641 
5.11% 


4638 
5.07% 


4634 
5.03% 



Table 1. QuASAR accuracy in genotyping. Each row reports the number 
of heterozygous and homozygous SNPs identified by QuASAR and the 
percentage of false discoveries when compared to 1KG genotypes. Each 
column uses a different MAP threshold to define high confidence genotype 
calls. 



We next sought to characterize QuASAR performance in 
genotyping and in calling ASE from RNA-seq experiments 
sequenced at different depths. In total we analyzed 18 samples (3 
replicates across 6 different time-points) for an individual for which 
genotypes were not previously available. We combined the 18 fastq 
files in different ways as input for QuASAR, to obtain an empirical 
power curve (see Figure 2). As expected, we observed that our 
ability to detect heterozygotes (MAP> 0.99) increases with the 
sequencing depth. The number of heterozygous sites detected seems 
to start to plateau at 10, 000 when the total number of RNAs-seq 
reads exceeds 150 million. At a more modest sequencing depth of 
16 million, we can still detect more than 1,000 heterozygous sites. 

After obtaining the genotypes, we then assessed whether there is 
evidence of ASE on any of the SNPs determined to be heterozygous. 
Using the LRT (9) we determined ASE and obtained p-values. We 
controlled the FDR using the g-value procedure (Storey, 2002). As 
shown in Figure 3, our ability to detect ASE greatly increases with 
the number of SNPs we are able to genotype, which in turn is a 
function of coverage (Figure 2). Using 100 million reads we achieve 
detecting about 50 SNPs with ASE at 10%FDR. 

In order to determine if the ASE inference in QuASAR is 
well calibrated and to compare QuASAR to other ASE tests we 
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Fig. 2. Empirical power in detecting heterozygous SNPs as a function of 
sequencing depth. Each point represents a single input dataset to QuASAR: 
either as a single experiment replicate and time point (red dot), combining 
multiple time points (2 = green, 3 = blue, 6 = purple), or combining replicates 
(1 = dot, 2 = triangle). The x-axis represents the log 10 of the total number 
of RNA-seq reads in the fastq input files. The y-axis represents the log 10 of 
the total number of SNPs that are determined to be heterozygous. 



Fig. 3. Empirical power in detecting ASE as a function of the number of 
heterozygous SNPs detected. Each point represents a single input dataset to 
QuASAR as in Figure 2. The rr-axis represents the log 10 of the total number 
of SNPs that are determined to be heterozygous. The y-axis represents the 
log 10 of the number of SNPs that have a significant p- value for ASE at 10% 
FDR. 



used QQ-plots, see Figure 4. A QQ-plot compares the quantiles 
observed from a test statistic to those that are expected under a 
null distribution (e.g., p-values are uniformly distributed between 
0 and 1). The shape of the QQ-plot curve is useful to judge how 
well the p- values are calibrated when we expect that a large number 
of the tests conducted are sampled from the null distribution. In this 
latter scenario, we expect that the QQ-plot curve would follow the 
y — x line for the bulk of the higher range of p-values. For small 
p- values, we expect that the curve starts to depart from the 1:1 line 
representing the small proportion of tests that are not sampled from 
the null distribution. Figure 4 clearly shows that the Binomial test 
is too optimistic, and will likely lead to many false discoveries. The 
Beta-binomial model is well calibrated, but if we are not certain 
about the genotype being a true heterozygote it can lead to very 
small p-values that are false positives. In QuASAR, we use a Beta- 
binomial model and we also consider uncertainty on the genotype, 
which results in the most conservative of the three different tests, 
while likely avoiding the most common causes of false positives in 
ASE analysis. 



5 DISCUSSION 

QuASAR is the first approach that detects genotypes and infers 
ASE from the same sequencing data. In this work, we focused 
on RNA-seq, but QuASAR can be applied to other data types 
(ChlP-seq, DNase-seq, ATAC-seq, and others). Indeed, the more 
experimental data we have from the same underlying individual 



across many experimental samples (data-types, conditions, cell- 
types or technical replicates), the more certainty we can gather about 
the genotype for any given site. The algorithm is computationally 
very fast, each EM iteration is O(LS) linear with the number of 
SNPs and samples and convergence is achieved in about 10 or less 
iterations. 

A key aspect of QuASAR ASE inference step is that it takes into 
account over-dispersion and genotype uncertainty resulting in a test 
that we have shown here to be well-calibrated. In many cases, the p- 
values obtained from biased statistics can be recalibrated to the true 
null distribution using a permutation procedure. Unfortunately, this 
is not possible for ASE inference, as randomly permuting the reads 
assigned to each allele would inadvertently assume that the reads are 
distributed according to a Binomial distribution. More complicated 
and computationally costly resampling procedures can be proposed, 
but it is not clear which additional assumptions may introduce and 
if they can correctly take into account genotyping uncertainty. 

If prior genotype information is available, it can also be provided 
as input to the algorithm. The prior uncertainty of the genotypes 
should be reflected in the form of prior probabilities for each 
genotype. In this paper, we have shown that we can obtain reliable 
genotype information from RNA-seq reads, thus making additional 
genotyping unnecessary if the endpoint is to detect ASE. Instead, 
sequencing the RNA-seq libraries at a higher depth is probably 
a better strategy as it greatly improves the power to detect ASE 
signals. 

Furthermore, as sequencing costs are decreasing very rapidly, 
ASE methods are becoming very attractive in applications where 
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15 - 

Binomial > 




Expected -\og m (p-value) 



Fig. 4. QQplot comparing the p-value distribution of 3 alternative 
methods for determining ASE. The rr-axis shows the log 10 quantiles of 
the p- values expected from the null distribution. The y-axis shows the log 10 
quantiles of the p-values computed from the real data using 3 different 
methods: i) Binomial (black) assumes M = oo no overdispersion; ii) Beta- 
binomial (blue) considers overdispersion but does not consider uncertainty 
in the genotype; iii) QuASAR uses the Beta-binomial distribution and 
uncertainty in the genotype calls. In all three cases the same set of SNPs 
are considered 

eQTL studies have been previously used. This is even more 
important in scenarios where collecting large number of samples 
is expensive or infeasible. Large scale eQTL studies are still 
very much necessary for fine-mapping, but allele specific analysis 
methods can provide unique insights into mechanisms that are 
uncovered only under specific experimental conditions, for example 
as a result of gene x environment interactions. 
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